Summary Note

Our main goal: to figure out what accounts the participants are following

Problem: Reverse chronological endpoint is capped (limited 10k tweets/month) 

To find out: How much we can spend, how many tweets we need per participant 

To get a sense of this, we are analyzing Rockwell pilot data and trying to estimate…

Strategy:


What changes have been made?

  1. As Brendan suggested in the last meeting, a new way of calculating the y-axis (= fraction of distinct accounts appearing in tweets collected) is added.
    • old: # of distinct accounts appearing in tweets / total # of distinct friends that each user has.
    • new: # of distinct accounts appearing in tweets / total # of distinct accounts that each user sees.
  2. In the last meeting, the x-axis was # of tweets collected in all plots. However, the problem is that this x-axis is not comparable across users since they have different number of friends - e.g. some users follow lots of friends thus naturally, they get more number of tweets; some others follow only few accounts thereby getting few number of tweets. Since the y-axis is in fraction/relative terms, the new rescaled x-axis is also added.
    • Rescaling method: divide x-axis (# of tweets collected) by the average tweets per second of each user.

To ease comparison, old version and the new version plots are drawn side-by-side. See notes on each plot.


Load pacakges

library(readr)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(grid)
library(gridExtra)
library(DT)
library(lubridate)
library(scales)

Prepare data

# Load data
# The data is cleaned and exported using Python.  
# Python code for making dataframe for R is provided upon request. 
df <- read_csv("df.csv")

# Define data type
df %>%
  mutate(
    user_id = as.factor(user_id),
    tweet_id = as.factor(tweet_id),
    friend_id = as.factor(account_id)
  ) %>%
  dplyr::select(-account_id) -> df


# Making new variables
# (1) Maximum number of each user's friends: max_friends_count 
# What is the maximum of user_friends_count? 
df %>% 
  dplyr::select(user_id, user_friends_count) %>% 
  distinct() %>% 
  group_by(user_id) %>% 
  mutate(max_friends_count = max(user_friends_count)) %>% 
  dplyr::select(-user_friends_count) %>% 
  distinct() -> max_data
# max_data: a new dataframe with `user_id` and `max_friends_count` as variables  

# Merge this 'max_data' into df 
df %>% 
  merge(max_data, by="user_id") -> df

# (2) Timestamp data of each tweet: tweet_timestamp 
# Let's clean timestamp data to have affinity with R lubridate pacakge 

list_timestamp <- str_split(df$tweet_timestamp, " ")  # make a list containing each string component of timestamp data 

Month = c()  # make an empty vector 
Day = c() 
Time = c()
Year = c() 
timestamp_dmyt = c() 

# fill these vectors with month, day, time, year components within each list element
for (i in 1:length(list_timestamp)) {
  list_timestamp[[i]][2] -> Month[i] 
  list_timestamp[[i]][3] -> Day[i]
  list_timestamp[[i]][4] -> Time[i]
  list_timestamp[[i]][6] -> Year[i]
  }

timestamp_dmyt = as.data.frame(cbind(Day, Month, Year, Time)) # bind these filled vectors and make it as a dataframe; store this dataframe as 'timestamp_dmyt' 

# now paste the strings into one and store them in a vector 'dmyt' 
dmyt = c() 
for (i in 1:nrow(timestamp_dmyt)) {
  dmyt[i] = paste(Day[i], Month[i], Year[i], Time[i])
}

# make 'dmyt' vector as a new variable of dataframe: 'tweet_timestamp'
df$tweet_timestamp = dmy_hms(dmyt)  # timestamp format: day-month-year-hour-minute-second 

# (3) Define x-axis: number of tweets collected
df %>%
  arrange(tweet_timestamp) %>%  # arrange the data by time
  group_by(user_id) %>% 
  count(tweet_id) %>%
  mutate(
    old_x = cumsum(n),  # old_x: number of tweets collected so far 
    max_n_tweets = max(old_x)  
  ) %>%
  dplyr::select(
    user_id, tweet_id, old_x, max_n_tweets
  ) -> df_for_x

df %>%
  inner_join(df_for_x, by=c("user_id", "tweet_id")) %>%
  arrange(user_id, tweet_timestamp) -> df 

#* [X] Re-scaling → divide x axis by the average tweets per second of each participant. 
#* For each participant, (1) take the first and last tweet in the data and compute the number of seconds between them, and then (2) divide the total number of tweets seen for the participant by the number of seconds.

df |> 
  group_by(user_id) |> 
  summarise(timediff = max(tweet_timestamp) - min(tweet_timestamp)) -> timeDiff

df |> 
  merge(timeDiff, by="user_id") |>
  group_by(user_id) |>
  mutate(
    avg_n_tweets_persec = max_n_tweets / as.numeric(timediff) 
  ) |> 
  ungroup() |> 
  mutate(
    new_x = old_x / avg_n_tweets_persec # new_x: number of tweets collected so far divided by the average number of tweets per seconds
    ) -> df2

# (4) Define y-axis: count how many distinct accounts are in the tweets (numerator) 
#* [Y] Re-scaling → make a fraction for y-axis (new denominator as Brendan suggested : maximum of the total distinct accounts you "see" (not you have) as a new denominator to make all 60 indviduals reach 1 in the end (individual plots)) 

df2 %>% 
  arrange(user_id, tweet_timestamp) %>%
  group_by(user_id) %>%
  mutate(
    numerator = cumsum(!duplicated(friend_id)),
    old_y = numerator / max_friends_count, 
    new_y =  numerator / max(numerator) 
  ) -> df2 

# df2 is the final data for drawing plots 
Changes made to the x- and y-axes are described in this table.
Variable Definition
old_x Number of tweets collected so far
new_x Number of tweets collected so far / Average number of tweets per second
old_y Fraction of distinct accounts appeared in tweets over the maximum number of friends each user has (= How many distinct accounts among all the friends that each user has have appeared in tweets pulled so far?)
new_y Fraction of distinct accounts appeared in tweets over the maximum number of friends each user sees from the tweets pulled so far (thus everyone reaches 1 at the end)

Plot 1.

Plots 1a-1d are scatter plots displaying patterns of change in the fraction of distinct accounts as we pull tweets.

df2 %>%
  group_by(user_id) %>%
  ggplot(aes(x=old_x, y=old_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("Number of Tweets Collected") +
  ylab("Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits=c(0, 120000),
                     labels = label_number(scale_cut = cut_short_scale())) +
  ggtitle("Plot 1a", subtitle = "old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot1a

df2 %>%
  group_by(user_id) %>%
  ggplot(aes(x=old_x, y=new_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits=c(0, 120000),
                     labels = label_number(scale_cut = cut_short_scale())) +
  ggtitle("Plot 1b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot1b

df2 %>%
  group_by(user_id) %>%
  ggplot(aes(x=new_x, y=old_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected / Avg # of Tweets per sec") +
  ylab("Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 30000000),
                     labels = label_number(scale_cut = cut_short_scale())) +
  ggtitle("Plot 1c", subtitle = "new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot1c

df2 %>%
  group_by(user_id) %>%
  ggplot(aes(x=new_x, y=new_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected / Avg # of Tweets per sec") +
  ylab("Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 30000000),
                     labels = label_number(scale_cut = cut_short_scale())) +
  ggtitle("Plot 1d", subtitle = "new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot1d


grid.arrange(plot1a, plot1b, plot1c, plot1d, nrow=2)

Plot 2.

What happens when we set the x-axis (and y-axis) to common logarithmic scales?

plot1a +
  xlab("Log(Number of Tweets Collected)") +
  scale_x_log10(n.breaks=10, 
                labels = scales::label_log()) +
  ggtitle("Plot 2a", subtitle = "Logged old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot2a

plot1b +
  xlab("Log(Number of Tweets Collected)") +
  scale_x_log10(n.breaks=10, 
                labels = scales::label_log()) +
  ggtitle("Plot 2b", subtitle = "Logged old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot2b

plot1c +
  xlab("Log(# of Tweets Collected / Avg # of Tweets per sec)") +
  scale_x_log10(n.breaks=10, 
                labels = scales::label_log()) +
  ggtitle("Plot 2c", subtitle = "Logged new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot2c

plot1d + 
  xlab("Log(# of Tweets Collected / Avg # of Tweets per sec)") +
  scale_x_log10(n.breaks=10, 
                 labels = scales::label_log()) +
  ggtitle("Plot 2d", subtitle = "Logged new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot2d


plot2c +
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 2e", subtitle = "Logged new x (rescaled # of tweets), \nLogged old y (# distinct accounts/max friends count)") -> plot2e

plot2d +
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 2f", subtitle = "Logged new x (rescaled # of tweets), \nLogged new y (# distinct accounts/max distinct accounts seen)") -> plot2f

grid.arrange(plot2a, plot2b, plot2c, plot2d, plot2e, plot2f, ncol=2)

Plot 3. Aggregate Plots

What if we aggregate users by taking the mean of fraction of distinct accounts (=y) at each point of the tweets collected (=x)?

For plots with the rescaled x-axis (= # of tweets/avg # of tweets per sec), I applied binning on the x-axis and then calculated weighted average of y.

# old_x, old_y
df2 %>%
  group_by(old_x) %>%
  summarize(y = mean(old_y)) %>%
  ungroup() %>%
  ggplot(aes(x=old_x, y=y)) +
  geom_point(alpha=0.5) + 
  geom_smooth(color='darkcyan', linewidth=1) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
                     labels = label_number(scale_cut = cut_short_scale())) + 
  ylim(c(0, 1)) + 
  ggtitle("Plot 3a", subtitle = "old x (# of tweets),\nold y (# distinct accounts/max friends count)") -> plot3a

# old_x, new_y
df2 %>%
  group_by(old_x) %>%
  summarize(y = mean(new_y)) %>%
  ungroup() %>%
  ggplot(aes(x=old_x, y=y)) +
  geom_point(alpha=0.5) + 
  geom_smooth(color='darkcyan', linewidth=1) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
                     labels = label_number(scale_cut = cut_short_scale())) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot3b

# make a new dataframe with binned x-axis and weighted mean 
df2 %>% 
  mutate(
    bins = cut(new_x , 
               breaks = pretty(new_x, n = (max(new_x)-min(new_x))/100000),  # 1057 levels 
               include.lowest = TRUE)) %>% 
  group_by(user_id, bins) %>% 
  mutate(weights = n()) %>% 
  ungroup() %>% 
  group_by(bins) %>% 
  summarise(old_y_weighted = weighted.mean(old_y, weights),
         new_y_weighted = weighted.mean(new_y, weights)) %>%
  ungroup() -> df3

# new_x, old_y 
df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  ggplot(aes(x=bins_x, y=old_y_weighted)) +
  geom_point(alpha=0.5) + 
  theme_few() + 
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 1000),
                     labels = label_number()) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3c", subtitle = "Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot3c
 
# new_x, new_y
df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  ggplot(aes(x=bins_x, y=new_y_weighted)) +
  geom_point(alpha=0.5) + 
  theme_few() + 
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 1000),
                     labels = label_number()) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3d", subtitle = "Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot3d

# Let's zoom in plot3c and plot3d:
df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  filter(bins_x < 160) %>%
  ggplot(aes(x=bins_x, y=old_y_weighted)) +
  geom_point(alpha=0.5) + 
  theme_few() + 
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 160),
                     labels = label_number()) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3c | Zoomed In", subtitle = "Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") +
  geom_vline(xintercept=153, lty=2, color="darkcyan") +
  geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3c_zoom

df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  filter(bins_x < 160) %>%
  ggplot(aes(x=bins_x, y=new_y_weighted)) +
  geom_point(alpha=0.5) + 
  theme_few() + 
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 160),
                     labels = label_number()) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3d | Zoomed In", subtitle = "Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") +
  geom_vline(xintercept=153, lty=2, color="darkcyan") +    
  geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3d_zoom


grid.arrange(plot3a, plot3b, plot3c, plot3d, plot3c_zoom, plot3d_zoom, ncol=2)

df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  dplyr::select(bins_x, bins) %>%
  unique() -> table_bin

datatable(table_bin, 
          caption = "Bin No. & Bin Range",
          filter="top")

Or this version of aggregate plots?

df2 %>% 
  group_by(old_x) %>% 
  summarize(y_old = mean(old_y), y_new = mean(new_y)) %>% 
  ungroup() %>% 
  pivot_longer(cols = c("y_old","y_new")) %>% 
  ggplot(aes(x=old_x, y=value, col=name)) + 
  geom_point(alpha=0.5) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 120000),
                     labels = label_number(scale_cut = cut_short_scale())) + 
   ylim(c(0, 1)) + 
  ggtitle("Plot 3ab: old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot3ab


df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>% 
  ggplot(aes(x=bins_x, y=value, col=name)) +
  geom_jitter(alpha=0.7, width=0.5, height=0.005) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 1000),
                     labels = label_number()) +  
  ylim(c(0, 1)) + 
  ggtitle("Plot 3cd: Binned new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y  \nBlue: weighted mean of old y") +
  geom_vline(xintercept=153, lty=2, color="darkcyan") +    
  geom_vline(xintercept=74, lty=2, color="darkcyan") -> plot3cd

grid.arrange(plot3ab, plot3cd)

Plot 4. Aggregate Plots with Logarithmic Scales

What happens when we set the x-axis (and y-axis) to common logarithmic scales and replicate plots 3a~3d?

plot3a +
  xlab("Log(Number of Tweets Collected)") +
  scale_x_log10(n.breaks=10,
                labels = scales::label_log()) +
  ggtitle("Plot 4a", subtitle = "Logged old x (# of tweets), \nold y (# distinct accounts/max friends count)") -> plot4a

plot3b +
  xlab("Log(Number of Tweets Collected)") +
  scale_x_log10(n.breaks=10, 
                labels = scales::label_log()) +
  ggtitle("Plot 4b", subtitle = "Logged old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot4b

plot3c +
  xlab("Log( Bins of [# of Tweets Collected / Avg # of Tweets per sec] )") +
  scale_x_log10(n.breaks=10, 
                labels = scales::label_log()) +
  ggtitle("Plot 4c", subtitle = "Logged & Binned new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") -> plot4c

plot3d + 
  xlab("Log( Bins of [# of Tweets Collected / Avg # of Tweets per sec] )") +
  scale_x_log10(n.breaks=10, 
                 labels = scales::label_log()) +
  ggtitle("Plot 4d", subtitle = "Looged & Binned new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") -> plot4d

plot4c +
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 4e", subtitle = "Logged & Binned new x (rescaled # of tweets), \nLogged old y (# distinct accounts/max friends count)") -> plot4e

plot4d +
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 4f", subtitle = "Logged & Binned new x (rescaled # of tweets), \nLogged new y (# distinct accounts/max distinct accounts seen)") -> plot4f

grid.arrange(plot4a, plot4b, plot4c, plot4d, plot4e, plot4f, ncol=2)

Or …. like this?

plot3ab + 
  xlab("Log(# of Tweets Collected)") +
  scale_x_log10(n.breaks=10,
                labels = scales::label_log()) +
  ggtitle("Plot 4ab: Logged old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot4ab

df3 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>% 
  ggplot(aes(x=bins_x, y=value, col=name)) +
  geom_jitter(alpha=0.7, height=0.005) + 
  theme_few() + 
  theme(legend.position="none") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  ylim(c(0, 1)) + 
  xlab("Log(Bins of [# of Tweets Collected / Avg # of Tweets per sec])") +
  scale_x_log10(n.breaks=10,
                labels = scales::label_log()) +
  ggtitle("Plot 4cd: Logged & Binned new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y  \nBlue: weighted mean of old y") -> plot4cd

plot4ab + 
  ylab("Log(Mean Fraction of Distinct Accounts (%))") + 
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 4ab_2: Logged old x (# of tweets)", subtitle = "Pink: Logged weighted mean of new y  \nBlue: Logged weighted mean of old y") -> plot4ab_2

plot4cd + 
  ylab("Log(Mean Fraction of Distinct Accounts (%))") + 
  scale_y_log10(n.breaks=10,  labels = scales::label_log()) +
  ggtitle("Plot 4ef: Logged & Binned new x (rescaled # of tweets)", subtitle = "Pink: Logged weighted mean of new y  \nBlue: Logged weighted mean of old y") -> plot4ef

grid.arrange(plot4ab, plot4cd, plot4ab_2, plot4ef)

Plot 5. Grid by Individual User

In the data frame, there are 60 unique users. Let’s redraw some of the plots by each individual user. I allowed scales of the x-axis to vary for each user.

df2 %>% 
  ggplot(aes(x=old_x, y=old_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
  scale_x_continuous(n.breaks = 5, 
                     labels = label_number(scale_cut = cut_short_scale())) +
  facet_wrap(~user_id, scales="free_x", ncol = 10) +
  ggtitle("Plot 5a", subtitle = "old x (# of tweets), \nold y (# distinct accounts/max friends count)") 

df2 %>% 
  ggplot(aes(x=old_x, y=new_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
  scale_x_continuous(n.breaks = 5, 
                     labels = label_number(scale_cut = cut_short_scale())) +
  facet_wrap(~user_id, scales="free_x", ncol = 10) +
  ggtitle("Plot 5b", subtitle = "old x (# of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") 

df2 %>% 
  ggplot(aes(x=new_x, y=old_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
  scale_x_continuous(n.breaks = 5, 
                     labels = label_number(scale_cut = cut_short_scale())) +
  facet_wrap(~user_id, scales="free_x", ncol = 10) +
  ggtitle("Plot 5c", subtitle = "new x (rescaled # of tweets), \nold y (# distinct accounts/max friends count)") 

df2 %>% 
  ggplot(aes(x=new_x, y=new_y, col=user_id)) +
  geom_point(alpha=0.5) +
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Fraction of Distinct Accounts Appearing in Tweets (%)") +
  scale_x_continuous(n.breaks = 5, 
                     labels = label_number(scale_cut = cut_short_scale())) +
  facet_wrap(~user_id, scales="free_x", ncol = 10) +
  ggtitle("Plot 5d", subtitle = "new x (rescaled # of tweets), \nnew y (# distinct accounts/max distinct accounts seen)") 

[Outliers] Distribution of Friends Count & Max Accounts Seen

It seems some people follow very few accounts while some others follow very many accounts. Let’s check distribution of the friends counts as well as maximum number of accounts observed in collected tweets.

df2 %>%
  group_by(user_id) %>% 
  mutate(max_accounts_seen = max(numerator)) %>%
  distinct(user_id, max_friends_count, max_accounts_seen) %>% 
  arrange(-desc(max_friends_count)) -> table_dta

datatable(table_dta, filter="top")

Let’s remove users whose max_accounts_seen is less than 10 & more than 1,000 - and re-draw aggregate plots.

Plot 6. Aggregate Plots Without Outliers

df2 %>%
  group_by(user_id) %>% 
  mutate(max_accounts_seen = max(numerator)) %>%
  filter(max_accounts_seen >= 10 & max_accounts_seen <= 1000) %>%
  ungroup() %>% 
  group_by(old_x) %>%
  summarize(y_old = mean(old_y), y_new = mean(new_y)) %>% 
  ungroup() %>% 
  pivot_longer(cols = c("y_old","y_new")) %>% 
  ggplot(aes(x=old_x, y=value, col=name)) + 
  geom_point(alpha=0.5) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("# of Tweets Collected") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10, limits = c(0, 45000),
                     labels = label_number(scale_cut = cut_short_scale())) + 
  ggtitle("Plot 6ab (w/o outliers): old x (# of tweets)", subtitle = "Pink: mean of new y \nBlue: mean of old y") -> plot6ab   

df2 %>% 
  mutate(
    bins = cut(new_x , 
               breaks = pretty(new_x, n = (max(new_x)-min(new_x))/100000),  # 1057 levels 
               include.lowest = TRUE)) %>% 
  group_by(user_id, bins) %>% 
  mutate(weights = n()) %>% 
  ungroup() %>%
  group_by(user_id) %>% 
  mutate(max_accounts_seen = max(numerator)) %>%
  filter(max_accounts_seen >= 10 & max_accounts_seen <= 1000) %>%
  ungroup() %>% 
  group_by(bins) %>% 
  summarise(old_y_weighted = weighted.mean(old_y, weights),
         new_y_weighted = weighted.mean(new_y, weights)) %>%
  ungroup() -> df4

df4 %>%
  mutate(bins_x = as.integer(bins)) %>% 
  pivot_longer(cols=c("old_y_weighted", "new_y_weighted")) %>% 
  ggplot(aes(x=bins_x, y=value, col=name)) +
  geom_jitter(alpha=1, width=0.5, height=0.005) + 
  theme_few() + 
  theme(legend.position="none") +
  xlab("Bins of [# of Tweets Collected / Avg # of Tweets per sec]") +
  ylab("Mean Fraction of Distinct Accounts (%)") +
  scale_x_continuous(n.breaks = 10,  limits=c(0, 80),
                     labels = label_number()) + 
  ggtitle("Plot 6cd (w/o outliers): new x (rescaled # of tweets)", subtitle = "Pink: weighted mean of new y  \nBlue: weighted mean of old y") +
  geom_vline(xintercept=29, lty=2, color="darkcyan") +    
  geom_vline(xintercept=77, lty=2, color="darkcyan") -> plot6cd

# bin no. 29: (2,800,000 , 2,900,000] 
# bin no. 77: (7,300,000 , 7,400,000]


grid.arrange(plot6ab, plot6cd)


Questions/Notes for Giovanni
  1. Double-check whether I did the binning & weighted average calculation right:
    • First, I binned the rescaled x-axis into 1057 levels in total

    • Second, by each user, I calculated the frequencies of each user within each bin (=weights)

    • Third, by each bin, I calculated the weighted mean of the y-axis, weighing by the weights calculated in the second step

    • Therefore, each bin get to have a single value of weighted mean of y

  2. In the aggregate plots with the rescaled x-axis (binned & weighted averaged), the scatter plots are discontinuous (e.g. Plot 4cd, 4ef, 6cd..)
    • Is there something going on here? Or is it just a meaningless pattern?
  3. Back to the main goal of all these pilot analyses…deciding when to stop pulling:
    • Which plot to base our final decision on?

    • After we narrow down on a few plots … I will redraw them with the distinct # of low-quality sources on the y-axis !